Skip to content

Conversation

@westonpace
Copy link
Member

The post join filter has very little explanation. It can also be confusing because, from a purely logical perspective, it is possible to see the post join filter as redundant. This PR attempts to clarify the description of the post join filter.

| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional |
| Post-Join Filter | A boolean condition to be applied to each potential match between the left and right
inputs. If it evaluates to false then the potential match is not considered a match. A join relation with
Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Much better than my local draft! :)

Two more things.

  • Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.
  • Should this be Optional, default True like hash/merge join?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.

Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y statement because this is not true for hash/merge join (the join expression for these relations is a series of equality conditions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the language of the description. The way you describe is more explicit that the post_join_filter IS part of the join condition, say saying what "matches" and what "does not match". This is not for try to reduce the output.

Also, can we drop Equi form HashEquiJoin? :)

@drin
Copy link
Member

drin commented Apr 23, 2025

A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, I think it'd be nice to add a comment in the .proto file to the effect of "post_join_filter should be resolved in conjunction (AND) with expression."

Comment on lines +258 to +260
// The post-join filter is a filter that is applied to the result of the join before an output
// record is produced. If the filter evaluates to false then the record is not considered a
// match.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:

  1. the post_join_filter is a comparison (lte) that uses a window function (count)
  2. the expression is a predicate with selectivity between 0 and 100%
  3. the expression produces many tuples from one input for a single tuple of the other input

(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.

Copy link
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluate post_join_filter on the inputs to expression even if you only evaluate its "truthyness" on joined records that expression evaluates as true.

Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables a LEFT JOIN b ON ... with a post-join filter that has a.Col1 = b.Col2. Is a.Col1 = b.Col2 expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?

Copy link
Member Author

@westonpace westonpace Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.

for left_record in left_records:
  has_match = False
  for right_record in right_records:
    if join_expression(left_record, right_record) and post_join_filter(left_record, right_record):
      has_match = True
      if join_type == Inner or join_type == Left:
        emit(combine(left_record, right_record):
  if has_match and join_type == LeftSemi:
    emit(left_record)
  elif not has_match and join_type == LeftAnti:
    emit(left_record)
  elif not has_match and join_type == Left:
    emit(combine(left_record, null))

If someone has an alternate proposal, is it possible to share your own pseudocode representation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to JoinRel.expression AND JoinRel.post_join_filter to correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko that's why I initially proposed to drop post_join_filter from JoinRel in the slack discussion. :)

Copy link
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that pseudocode weston. My intention is to make the wording clearly reflect that post_join_filter(left_record, right_record) is valid and post_join_filter(combine(left_record, right_record)) is invalid.

Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:

# above pseudocode here
...
if post_join_filter(emitted_record):
  really_emit(emitted_record)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.

@jacques-n
Copy link
Contributor

It's difficult to follow the threads in this discussion.

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

The post join filter does not logically interact with the join type at all. The composite exists because many systems have it and it can be a beneficial physical pattern. The reason it has to be stated separately from the join predicate is to have covering behavior of all possible filter conditions. I always have to remind myself of which conditions can and cannot be moved into a join evaluation clause.

I'm supportive of clarifying the text if people are unclear as to what post join filter means.

@westonpace
Copy link
Member Author

westonpace commented May 5, 2025

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

@jacques-n

This is not the conclusion we came to. I believe the content of the PR is still accurate with the threads, so you can just review the content and ignore the discussion.

For example:

SELECT * FROM a LEFT OUTER JOIN b ON a.id = b.id AND b.other_field IS NOT NULL

As it stands this filter will emit one row for each row in a. If the filter is moved into the WHERE clause then it will emit fewer rows (less or equal to the number of rows emitted by an inner join).

From your GPT link this matches:

Conditions on the non-preserved side that would otherwise eliminate rows that should remain when there's no match

@drin
Copy link
Member

drin commented May 5, 2025

There is a thread that discusses the description that would go in the website: discussion on website description

I think further discussion on this can be deferred until agreement on post_join_filter semantics is finalized.

Then, there's a thread discussing the comment in the .proto file for post_join_filter: discussion on comment in spec

This discussion assumes that Weston's assertion in the description is the correct semantics of post_join_filter. As Weston points out, that description is in contradiction to Jacques's comment.

This is not the conclusion we came to

In this PR, we never collectively discussed what it should be versus what it is. The description says:

The post join filter has very little explanation... from a purely logical perspective, it is possible to see the post join filter as redundant.

And I asked:

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, ...

One thing that was referenced in slack is the substrait FAQ: "The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join." This FAQ then references velox hash-join implementation, which says: "Filter is optional. If specified it can be any expression over the results of the join."

It occurs to me that the FAQ says "post-join filter... is not always equivalent to an explicit Filter relation AFTER the join," yet the referenced velox documentation says "If specified, it can be any expression over the results of the join." These seem directly contradictory to me, since "the results of of the join" sounds quite a bit like "AFTER the join".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants